System and methods for finding hidden topics of documents and preference ranking documents

ABSTRACT

Systems and methods are disclosed to perform preference learning on a set of documents includes receiving raw input features from the set of documents stored on a data storage device; generating polynomial combinations from the raw input features; generating one or more parameters; applying the parameters to one or more classifiers to generate outputs; determining a loss function and parameter gradients and updating parameters determining one or more sparse regularizing terms and updating the parameters; and expressing that one document is preferred over another in a search query and retrieving one or more documents responsive to the search query.

This application claims priority to Provisional Application Ser. Nos.61/360,599 filed Jul. 1, 2010 and 61/381,310 filed Sep. 9, 2010, thecontents of which are incorporated by reference.

BACKGROUND

The present application relates to systems and methods for classifyingdocuments. Learning preferences among a set of objects (e.g. documents)given another object as query is a central task of information retrievaland text mining. One of the most natural frameworks for this task is thepairwise preference learning, expressing that one document is preferredover another given the query. Most existing methods learn the preferenceor relevance function by assigning a real valued score to a featurevector describing a (query, object) pair. This feature vector normallyincludes a small number of hand-crafted features, such as the BM25scores for the title or the whole text, instead of the very natural rawfeatures. A drawback of using hand-crafted features is that they areoften expensive and specific to datasets, requiring domain knowledge inpreprocessing. In contrast, the raw features are easily available, andcarry strong semantic information (such as word features in textmining).

Polynomial models (using combination of features as new features) on rawfeatures are powerful and are easy to acquire in many preferencelearning problems. However, there are usually a very large number offeatures, make storing and learning difficult. For example, a basicmodel which uses the raw word features under the supervised pairwisepreference learning framework and consider feature relationships in themodel. In this model, D be the dictionary size, i.e. the size of thequery and document feature set, given a query qεR^(D) and a documentdεR^(D), the relevance score between q and d is modeled as:

$\begin{matrix}{{{f\left( {q,d} \right)} = {{q^{T}{Wd}} = {\sum\limits_{i,j}{W_{ij}{\Phi\left( {q,d_{j}} \right)}}}}},} & (1)\end{matrix}$where Φ(q_(i),d_(j))=q_(i)·d_(j) and W_(ij) models therelationship/correlation between i^(th) query feature q_(i) and j^(th)document feature d_(j). This is essentially a linear model with pairwisefeatures Φ(.,.) and the parameter matrix WεR^(D×D) is learned fromlabeled data. Compared to most of the existing models, the capacity ofthis model is very large because of the D² free parameters which cancarefully model the relationship between each pair of words. From asemantic point of view, a notable superiority of this model is that itcan capture synonymy and polysemy as it looks at all possible crossterms, and can be tuned directly for the task of interest.

Although it is very powerful, the basic model in Eq. (1) suffers fromthe following weakness which hinders its wide application:

1. Memory requirement: Given a dictionary size D, the model requires alarge amount of memory to store the W matrix with a size quadratic in D.When D=10,000, storing W needs nearly 1 Gb of RAM (assuming double);when D=30,000, W storage requires 8 Gb of RAM.

2. Generalization ability: Given D² free parameters (entries of W), whenthe number of training samples is limited, it can easily lead tooverfitting. Considering the dictionary with the size D=10,000, thenD²=10⁸ free parameters that need to be estimated which is far too manyfor small corpora.

Recently researchers found out that raw features (e.g. words for textretrieval) and their pairwise features which describe relationshipsbetween two raw features (e.g. word synonymy or polysemy) could greatlyimprove the retrieval precision. However, most existing methods can notscale up to problems with many raw features (e.g. English vocabulary),due to the prohibitive computational cost on learning and the memoryrequirement to store a quadratic number of parameters.

Since such models are not practical, present systems often create asmaller feature space by dimension reduction technologies such as PCA.When raw features are used, polynomial models are avoided. When thepolynomial models are used, various approaches can be used, including:

-   -   1. Sparse model: remove the parameters that are less important.        However, empirical studies on very large sparse models are        lacking    -   2. Low rank approximation: try to decompose the relationship        matrix.    -   3. Hashing: try to put the big number of parameters into a        smaller number of bins.

In a related trend, unsupervised dimension reduction methods, likeLatent Semantic Analysis (LSA) have been widely used in the field oftext mining for hidden topic detection. The key idea of LSA is to learna projection matrix that maps the high dimensional vector spacerepresentations of documents to a lower dimensional latent space, i.e.so called latent topic space. However LSA could not provide a clear andcompact topic-word relationship due LSA projects each topic as aweighted combination of all words in the vocabulary.

Two existing models are closely related to LSA and have been used tofind compact topic-word relationships from text data. Latent DirichletAllocation (LDA) provides a generative probabilistic model from Bayesianperspective to search for topics. LDA can provide the distribution ofwords given a topic and hence rank the words for a topic. However LDAcould only handle a small number of hidden topics. Sparse coding, asanother unsupervised learning algorithm, learn basis functions whichcapture higher-level features in the data and has been successfullyapplied in image processing and speech recognition. Sparse coding couldprovide a compact representation between the document to topics, butcould not provide a compact represent between topic to words, sincetopics are learned basis functions associated to all words.

SUMMARY

In one aspect, a method to perform preference learning on a set ofdocuments includes receiving raw input features from the set ofdocuments stored on a data storage device; generating polynomialcombinations from the raw input features; generating one or moreparameters W; applying W to one or more classifiers to generate outputs;determining a loss function and parameter gradients and updating W;determining one or more sparse regularizing terms and updating W; andexpressing that one document is preferred over another in a search queryand retrieving one or more documents responsive to the search query.

In another aspect, a method to apply a query to a set of documentsincludes reconstructing a document term matrix XεR^(N×M) by minimizingreconstruction errors with min ∥X−UA∥, wherein A is a fixed projectionmatrix and U is a column orthogonal matrix; determining a loss functionand parameter gradients to generate U; fixing U while determining theloss function and sparse regularization constraints on the projectionmatrix A; generating parameter coefficients and generating a sparseprojection matrix A; and generating a Sparse Latent Semantic Analysis(Sparse SLA) model and applying the model to a set of documents anddisplaying documents matching a query.

Advantages of the preferred embodiments may include one or more of thefollowing. Sparsity is applied to the model which is achieved by an l₁regularization with an efficient online learning method. The methodachieves good performance with fast convergence, while remaining sparseduring training (small memory consumption), on a model with hundreds ofmillions of parameters. Extended models can learn group structure oreven hierarchical structure of the words using group lasso type ofregularization. The prior knowledge on the structure of W can also beimposed. In short, the system is highly efficient, fast and requiresminimal computational resources.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary process for generating preference learningmodels.

FIG. 2 shows an exemplary “sparse latent semantic analysis” learningmethod.

FIG. 3 shows an exemplary computer that can analyze documents.

DESCRIPTION

1) Learning Preferences Upon Millions of Features Over Sparsity

FIG. 1 shows an exemplary process for generating preference learningmodels. This method, also known as Learning Preferences upon Millions ofFeatures Over Sparsity, can quickly generate preference models thatrequire a small amount memory and achieve good accuracy. Turning now toFIG. 1, raw input features are received by the process (102). Next,polynomial combinations are generated (104). Parameters W, as detailedbelow, are derived (106), and applied to one or more classifiers (108)which generate outputs (110). The process then determines loss andparameter gradients (120). The process then updates W parameters used in106. Next, the process determines sparse regularizing terms (130), andthen updates W parameters used in 106.

The system of FIG. 1 constrains W to be a sparse matrix with many zeroentries for the pairs of words which are irrelevant for the preferencelearning task. If W is a highly sparse matrix, then it consumes muchless memory and has only a limited number of free parameters to beestimated. In other words, a sparse W matrix will allow the process togreatly scale up the dictionary size to model infrequent words which areoften essential for the preference ordering. In addition, a faster andmore scalable learning process is achieved since most entries of W arezeros so that those multiplications can be avoided. Another advantage oflearning a sparse representation of W is its good interpretability. Thezero W_(ij) indicate that i^(th) query feature and j^(th) documentfeature are not correlated to the specific task. A sparse W matrix willaccurately capture correlation information between pairs of words andthe learned correlated word pairs could explain the semantic rationalebehind the preference ordering.

In order to enforce the sparsity on W, the process imposes an entry-wisel₁ regularization on W. Under certain conditions, l₁ regularization cancorrectly uncover the underlying ground truth sparsity pattern. Since inmany preference learning related applications (e.g. search engine), themodel needs to be trained in a timely fashion and new (query, document)pairs may be added to the repository at any time, stochastic gradientdescent in the online learning framework is used in one embodiment. Toenforce the l₁ regularization, a mini-batch shrinking step is performedfor every T iterations in the stochastic gradient descent which can leadto the sparse solution. Moreover, to reduce the additional biasintroduced by the l₁ regularization, a refitting step is used to improvethe preference prediction while keeping the learned sparsity pattern.Using sparsity, the model can be well controlled and efficientlylearned. The result is a stochastic-gradient-descent-based sparselearning method that converges quickly with a small number of trainingsamples. The method remains sparse and thus is memory consumptionefficient in the whole training process. The method will make polynomialfeature models (such as supervised semantic indexing) more effective inthe following ways:

-   -   1. Consumes less memory.    -   2. Fast training process.    -   3. Able to handle more features.

In one embodiment, the system receives a set of documents in the corpusas {d^(k)}_(k=1) ^(K)⊂R^(D) and a query as qεR^(D), where D is thedictionary size, and the j^(th) dimension of a document/query vectorindicates the frequency of occurrence of the j^(th) word, e.g. using thetf-idf weighting and then normalizing to unit length. Given a query qand a document d, the system learns a scoring function ƒ(q,d) thatmeasures the relevance of d to q. In one embodiment, ƒ is a linearfunction which takes the form of Eq. (1). Each entry of W represents a“relationship” between a pair of words.

Given a set of tuples R (labeled data), where each tuple contains aquery q, a preferred document d⁺ and an unpreferred (or lower ranked)document d⁻, the system learns a W such that q^(T)Wd⁺>q^(T)Wd⁻, makingthe right preference prediction.

For that purpose, given tuple (q,d⁺,d⁻), a margin rank loss techniquecan be used:L _(w)(q,d ⁺ ,d ⁻)≡h(q ^(T) Wd ⁺ −q ^(T) Wd ⁻)=max(0,1−q ^(T) Wd ⁺ +q^(T) Wd ⁻),  (2)where h(x)≡max(0,1−x) is the hinge loss function as adopted in SVM.

The system learns the W matrix which minimizes the loss in Eq. (2) bysumming over all tuples (q,d⁺,d⁻) in R:

$\begin{matrix}{W^{*} = {\underset{W}{\arg\;\min}\frac{1}{R}{\sum\limits_{{({q,d^{+},d^{-}})} \in R}{{L_{W}\left( {q,d^{+},d^{-}} \right)}.}}}} & (3)\end{matrix}$

In general, the size of R is large and new tuples may be added to R in astreaming manner, which makes it difficult to directly train theobjective in Eq. (3). To address this issue, a stochastic (sub)gradientdescent (SGD) is used in an online learning framework. At eachiteration, the system randomly draws a sample (q, d⁺,d⁻) from R, computethe subgradient of L_(w) (q, d⁺, d⁻) with respect to W as following:

$\begin{matrix}{{\nabla\;{L_{W}\left( {q,d^{+},d^{-}} \right)}} = \left\{ {\begin{matrix}{- {q\left( {d^{+} - d^{-}} \right)}^{T}} & {{{if}\mspace{14mu} q^{T}{W\left( {d^{+} - d^{-}} \right)}} < 1} \\0 & {otherwise}\end{matrix},} \right.} & (4)\end{matrix}$and then update the W matrix accordingly.

W⁰ is initialized to the identity matrix as this initializes the modelto the same solution as a cosine similarity score. The strategyintroduces a prior expressing that the weight matrix should be close tothe identity matrix. Term correlations are used when it is necessary toincrease the score of a relevant document, or conversely, decrease thescore of a irrelevant document. As for the learning rate η_(t), adecaying learning rate is used with

${\eta_{t} = \frac{C}{\sqrt{t}}},$where C is a pre-defined constant as the initial learning rate. Anexemplary pseudo code for the Stochastic Subgradient Descent (SGD)process is described below:

Initialization: W⁰ ε R^(D×D) , learning rate sequence {η_(t)}, Iteratefor t = 1,2,... until convergence of W^(t) :   1. Randomly draw a tuple(q,d⁺,d⁻) ε R   2. Compute the subgradient of L_(W) _(t−1) (q,d⁺,d⁻)with respect to W :   ∇L_(W) _(t−1) (q,d⁺,d⁻)   3. Update W^(t) =W^(t−1) − η_(t)∇L_(W) _(t−1) (q,d⁺,d⁻)

In one embodiment, an explicit form of the subgradient ∇L_(w)(q,d⁺,d⁻)takes the following form:

$\begin{matrix}{{\nabla\;{L_{W}\left( {q,d^{+},d^{-}} \right)}} = \left\{ {\begin{matrix}{- {q\left( {d^{+} - d^{-}} \right)}^{T}} & {{{if}\mspace{14mu} q^{T}{W\left( {d^{+} - d^{-}} \right)}} < 1} \\0 & {otherwise}\end{matrix}.} \right.} & (5)\end{matrix}$

As discussed in the introduction, the model and the learning algorithmin the previous section will lead to a dense W matrix which consumes alarge amount of memory and has poor generalization ability for smallcorpora. To address these problems, a sparse model with a small numberof nonzero entries of W is learned. In order to obtain a sparse W, anentry-wise l₁ norm is added to W as a regularization term to the lossfunction and optimized as the following objective function:

$\begin{matrix}{{W^{*} = {{\underset{W}{\arg\;\min}\frac{1}{R}{\sum\limits_{{({q,d^{+},d^{-}})} \in R}{L_{W}\left( {q,d^{+},d^{-}} \right)}}} + {\lambda{W}_{1}}}},} & (6)\end{matrix}$where

${W}_{1} = {\sum\limits_{i,{j = 1}}^{D}{W_{ij}}}$is the entry-wise l₁ norm of W and λ is the regularization parameterwhich controls the sparsity level (the number of nonzero entries) of W.In general, a larger λ leads to a more sparse W. On the other hand, toosparse a W will miss some useful relationship information among wordpairs (considering diagonal W as an extreme case). Therefore, inpractice, λ is tuned to obtain a W with a suitable sparsity level.

Next, the training of the Sparse Model is discussed. To optimize Eq.(6), a variant of the general sparse online learning scheme is applied.After updating W^(t) at each iteration in SGD, a shrinkage step isperformed by solving the following optimization problem:

$\begin{matrix}{{{\hat{W}}^{t} = {{\underset{W}{\arg\;\min}\frac{1}{2}{{W - W^{t}}}_{F}^{2}} + {\lambda\;\eta_{t}{W}_{1}}}},} & (7)\end{matrix}$and then use Ŵ^(t) as the starting point for the next iteration. In Eq.7, ∥·∥_(F) denote the matrix Frobenius norm and η_(t) is the decayinglearning rate for the t^(th) iteration. Performing (7) will shrink thoseW_(ij) ^(t) with an absolute value less than λη_(t) to zero and hencelead to a sparse W matrix.Although performing the shrinkage step leads a sparse W solution, itbecomes expensive for a large dictionary size D. For example, whenD=10,000, D²=10⁸ operations are needed. In one embodiment, shrinkage isperformed for every T iteration cycles. In general, a smaller Tguarantees that the shrinkage step can be done in a timely fashion sothat the entries of W will not grow too large to produce inaccurate∇L_(w)(q,d⁺, d⁻); on the other hand, a smaller T increases thecomputational cost of the training process. In one embodiment, T=100.When t is a multiple of T, the shrinkage step with a cumulativeregularization parameter is used for solving the following optimizationproblem:

$W^{T} = {{\underset{W}{\arg\;\min}\frac{1}{2}{{W - W_{0}}}_{F}^{2}} + {\lambda{\sum\limits_{t = 1}^{T}{\eta_{t}{{W}_{1}.}}}}}$By taking the shrinkage step at every iteration, the learned sparse W isa good approximation. Pseudo-code for the Sparse SGD process is:

Sparse SGD Initialization: W⁰ ∈ R^(D×D), T, learning rate sequence{η_(t)}. Iterate for t = 1,2, . . . until convergence of W^(t): 1.Randomly draw a tuple (q, d⁺, d⁻) ε R 2. Compute the subgradient ofL_(W) _(t−1) (q, d⁺, d⁻) with respect to W: ∇L_(W) _(t−1) (q, d⁺, d⁻) 3.Update W^(t) = W^(t−1) − η_(t) ∇ L_(W) _(t−1) (q, d⁺, d⁻) 4. If (t mod T= 0)$W^{T} = {{\underset{W}{argmin}\frac{1}{2}{{W - W_{0}}}_{F}^{2}} + {\lambda{\sum\limits_{t = 1}^{T}{\eta_{t}{{W}_{1}.}}}}}$

Next, refitting the Sparse Model is discussed. From Eq. (8), l₁regularization will not only shrink the weights for uncorrelated wordpairs to zero but also reduces the absolute value of the weights forcorrelated word pairs. This additional bias introduced by l₁regularization often harm the prediction performance. In order to reducethis bias, the system refits the model without l₁ regularization, butenforcing the sparsity pattern of W by minimizing the followingobjective function:

$\begin{matrix}{{W^{*} = {\underset{W}{\arg\;\min}\frac{1}{R}{\sum\limits_{{({q,d^{+},d^{-}})} \in R}{L_{P_{\Omega}{(W)}}\left( {q,d^{+},d^{-}} \right)}}}},} & (9)\end{matrix}$SGD is used to minimize Eq. (9), but ∇L_(W)(q,d⁺,d⁻) is replaced with∇L_(P) _(Ω) ^((W))(q,d⁺,d⁻). Using the chain rule for sub gradient,ΔL_(P) _(Ω) ^((W))(q,d⁺,d⁻) takes the following form:

${\nabla\;{L_{P_{\Omega}{(W)}}\left( {q,d^{+},d^{-}} \right)}} = \left\{ {\begin{matrix}{- {P_{\Omega}\left( {q\left( {d^{+} - d^{-}} \right)}^{T} \right)}} & {{{if}\mspace{14mu} q^{T}{P_{\Omega}(W)}\left( {d^{+} - d^{-}} \right)} < 1} \\0 & {otherwise}\end{matrix}.} \right.$

In Eq. (6), the system enforces the sparsity and regularizes each W_(ij)with the same regularization parameter λ which does not incorporate theprior knowledge about the importance of the word pair (i, j) for thespecific preference ordering. For example, the most frequent pairs ofwords, e.g. (a, the), (in, on), do not convey much information; so theregularization parameter for those pairs should be large to let W_(ij)have more tendency to become zero. On the other hand, some less frequentpairs of words, e.g. (Israel, Jerusalem) are very essential for thepreference ordering and the corresponding regularization parametersshould be small. Therefore, instead of using the same λ to regularizeall |W_(ij)|, the system assign different regularization parameters foreach |W_(ij)| and propose to learn the sparse W via the followingobjective function:

$\begin{matrix}{{{\frac{1}{R}{\sum\limits_{{({q,d^{+},d^{-}})} \in R}{L_{W}\left( {q,d^{+},d^{-}} \right)}}} + {\lambda{\sum\limits_{i,{j = 1}}^{D}{\gamma_{ij}{W}_{ij}}}}},} & (10)\end{matrix}$where the hyperparameter γ_(ij) measures the importance of word pair (i,j) for the preference learning task and it should be large for thoseless important word pair. Based on the above discussion, in this paper,is set to be proportional to ƒ_(i)·ƒ_(j), where ƒ_(i), ƒ_(j) are thedocument frequencies for i^(th) and j^(th) words. We normalize γ_(ij)such that max_(i,j)γ_(ij)=1. Eq. (10) can be minimized as:

$W^{t} = {{\underset{W}{\arg\;\min}\frac{1}{2}{PW}} - {W^{t}P_{F}^{2}} + {\lambda{\sum\limits_{k = {t - T + 1}}^{t}{\eta_{t}\lambda{\sum\limits_{i,{j = 1}}^{D}{\gamma_{ij}{{W}_{ij}.}}}}}}}$

The prediction performance gets improved after the refitting step. Incontrast to the traditional preference learning or “learning to rank”with a few hundred hand-crafted features, the basic model directlyperforms learning on actual words and considers their pairwiserelationships between query and document. Although the pairwiserelationship of words could improve and provide additional semanticinformation, the basic model suffers from storage overloads andparameter overfitting. To overcome these drawbacks, sparsity is appliedto the model which is achieved by the l₁ regularization with anefficient online learning algorithm. The method achieves goodperformance with fast convergence, while remaining sparse duringtraining (small memory consumption), on a model with hundreds ofmillions of parameters. The inventors contemplate that extended modelscan learn group structure or even hierarchical structure of the wordsusing group lasso type of regularization. The prior knowledge on thestructure of W can also be imposed.

2) Sparse Latent Semantic Analysis

As discussed above, LSA is one of the most successful tools for learningthe concepts or latent topics from text, has widely been used for thedimension reduction purpose in information retrieval. Given adocument-term matrix XεR^(N×M), where N is the number of documents and Mis the number of words, and the number of latent topics (thedimensionality of the latent space) is set as D (D≦min{N,M}), LSAapplies singular value decomposition (SVD) to construct a low rank (withrank-D) approximation of X:X≈USV^(T), where the column orthogonalmatrices UεR^(N×D) (U^(T) U=I) and VεR^(M×D) (V^(T)V=I) representdocument and word embeddings into the latent space. S is a diagonalmatrix with the D largest singular values of X on the diagonal.Subsequently, the so-called projection matrix defined as A=S⁻¹V^(T)provides a transformation mapping of documents from the word space tothe latent topic space, which is less noisy and considers word synonymy(i.e. different words describing the same idea). However, in LSA, eachlatent topic is represented by all word features which sometimes makesit difficult to precisely characterize the topic-word relationships.

FIG. 2 shows an exemplary learning method named “sparse latent semanticanalysis” (SLSA) that adds sparsity constraints on the projection matrixinside LSA. The Sparse LSA approach improves LSA as an optimizationproblem which minimizes the approximation error under the orthogonalityconstraint of U. Based on this formulation, the sparsity constraint ofthe projection matrix A is added via the l₁ regularization as in thelasso model. By enforcing the sparsity on A, the model has the abilityto automatically select the most relevant words for each latent topic.SLSA provides a more compact representation of the topic-wordrelationship, and is memory consumption efficient in the whole trainingprocess.

Referring now to FIG. 2, raw documents are received as input (202). Theprocess reconstructs X by min |X−UA| to get U and A (204).Reconstruction errors are determined (206). Next, the process, with afixed projection matrix A, determines loss and parameter gradients toobtain U (208). Next, the process fixes U and determines loss and sparseregularization constraints on the projection matrix A. The process alsogenerates the parameter coefficients to get the updated projectionmatrix A (210) and sends the updated matrix A to operation 204.

A notable advantage of sparse LSA is its good interpretability intopic-word relationship. Sparse LSA automatically selects the mostrelevant words for each latent topic and hence provides us a clear andcompact representation of the topic-word relationship. Moreover, for anew document q, if the words in q has no intersection with the relevantwords of d-th topic (nonzero entries in A^(d), the d-th row of A), thed-th element of {circumflex over (q)}, A^(d) q, will become zero. Inother words, the sparse latent representation of {circumflex over (q)}clearly indicates the topics that q belongs to.

Another benefit of learning sparse A is to save computational cost andstorage requirements when D is large. In traditional LSA, the topicswith larger singular values will cover a broader range of concepts thanthe ones with smaller singular values. For example, the first few topicswith largest singular values are often too general to have specificmeanings. As singular values decrease, the topics become more and morespecific. Therefore, we might want to enlarge the number of latenttopics D to have a reasonable coverage of the topics. However, given alarge corpus with millions of documents, a larger D will greatlyincrease the computational cost of projection operations in traditionalLSA. In contrary, for Sparse LSA, projecting documents via a highlysparse projection matrix will be computationally much more efficient;and it will take much less memory for storing A when D is large.

In order to obtain a sparse A, inspired by the lasso model, anentry-wise l₁-norm of A is added as the regularization term to the lossfunction and formulate the Sparse LSA model as:min_(U,A)½∥X−UA∥_(F) ² +λ∥A∥ ₁  (14)subject-to:U ^(T) U=I,where

${A}_{1} = {\sum\limits_{d = 1}^{D}{\sum\limits_{j = 1}^{M}{a_{dj}}}}$is the entry-wise l₁-norm of A and λ is the positive regularizationparameter which controls the density (the number of nonzero entries) ofA. In general, a larger λ leads to a sparser A. On the other hand, a toosparse A will miss some useful topic-word relationships which harms thereconstruction performance. Therefore, in practice, larger λ should beused to obtain a more sparse A while still achieving good reconstructionperformance.

Although the optimization problem is non-convex, fixing one variable(either U or A), the objective function with respect to the other isconvex. Therefore, one approach to solve Eq. (14) is by:

-   -   1. When U is fixed, let A, denote the j-th column of A; the        optimization problem with respect to A:

${{\min\limits_{A}{\frac{1}{2}{{X - {UA}}}}} + {\lambda{A}_{1}}},$can be decomposed in to M independent ones:

$\begin{matrix}{{{{\min\limits_{A_{j}}{\frac{1}{2}{{X_{j} - {UA}_{j}}}_{2}^{2}}} + {\lambda{A_{j}}_{i}}};{j = 1}},\ldots\mspace{14mu},{M.}} & (15)\end{matrix}$Each subproblem is a standard lasso problem where X_(j) can be viewed asthe response and U as the design matrix. To solve Eq. (15), a lassosolver can be applied, which is essentially a coordinate descentapproach.

-   -   2. When A is fixed, the optimization problem is equivalent to:        min_(U)½∥X−UA∥_(F) ²  (16)        subjectto:U ^(T) U=I.

The objective function in Eq. (16) can be further written as:

$\begin{matrix}{{\frac{1}{2}{{X - {UA}}}_{F}^{2}} = {\frac{1}{2}{{tr}\left( {\left( {X - {UA}} \right)^{T}\left( {X - {UA}} \right)} \right)}}} \\{= {{- {{tr}\left( {A^{T}U^{T}X} \right)}} + {\frac{1}{2}{{tr}\left( {X^{T}X} \right)}} + {\frac{1}{2}{{tr}\left( {A^{T}U^{T}{UA}} \right)}}}} \\{{= {{- {{tr}\left( {A^{T}U^{T}X} \right)}} + {\frac{1}{2}{{tr}\left( {X^{T}X} \right)}} + {\frac{1}{2}{{tr}\left( {A^{T}A} \right)}}}},}\end{matrix}$where the last equality is according to the constraint that U^(T)U=I. Bythe fact that tr(A^(T)U^(T)X)≡tr(U^(T)XA^(T)), the optimization problemin Eq. (16) is equivalent tomax_(U) tr(U ^(T) XA ^(T))  (17)Subject to U ^(T) U=I.Let V=XA^(T). In fact, V is the latent topic representations of thedocuments X. Assuming that V is full column rank, i.e. with rank(V)=D,Eq. (17) has the closed form solution when the singular valuedecomposition (SVD) of V is V=PΔQ, the optimal solution to Eq. (17) isU=PQ.

As for the starting point, any A⁰ or U⁰ stratifying) (U⁰)^(T)U⁰=I can beadopted. One embodiment uses a simple initialization strategy for U⁰ asfollowing:

$\begin{matrix}{{U^{0} = \begin{pmatrix}I_{D} \\0\end{pmatrix}},} & (18)\end{matrix}$where I_(D) the D by D identity matrix. It is easy to verify that)(U⁰)^(T)U⁰=I.

The optimization procedure can be summarized as pseudo code below:

Optimization Algorithm for Sparse LSA Input: X, the dimensionality ofthe latent space D, regularization parameter λ${{{Initialization}\text{:}\mspace{14mu} U^{0}} = \begin{pmatrix}I_{D} \\0\end{pmatrix}},$ Iterate until convergence of U and A: 1. Compute A bysolving M lasso problems as in Eq. (15) 2. Project X onto the latentspace: V = XA^(T). 3. Compute the SVD of V: V = PΔQ and let U = PQ .Output: Sparse projection matrix A.

As for the stopping criteria, let ∥·∥_(∞) denote the matrix entry-wisel_(∞)-norm, for the two consecutive iterations t and t+1, the processcomputes the maximum change for all entries in U andA:∥U^((t+1))−U^((t))∥_(∞) and ∥A^((t+1))−A^((t))∥_(∞); and stop theoptimization procedure when both quantities are less than the prefixedconstant τ. In one implementation, τ=0.01.

Next, two extensions of the Sparse LSA is discussed:

-   -   1. Group Structured Sparse LSA with group structured        sparsity-inducing penalty added to select the most relevant        groups of features relevant to the latent topic.    -   2. Non-negative Sparse LSA which enforces non-negativity        constraint on the projection matrix A to provide a pseudo        probability distribution of each word given the topic, similar        as in Latent Dirichlet Allocation (LDA).

Group Structured Sparse LSA

Although entry-wise l₁-norm regularization leads to the sparseprojection matrix A, it does not take advantage of any prior knowledgeon the structure of the input features (e.g. words). When the featuresare naturally clustered into groups, it is more meaningful to enforcethe sparsity pattern at a group level instead of each individualfeature; so that we can learn which groups of features are relevant to alatent topic. It has many potential applications in analyzing biologicaldata. For example, in the latent gene function identification, it ismore meaningful to determine which pathways (groups of genes withsimilar function or near locations) are relevant to a latent genefunction (topic).

Similar to the group lasso approach, the process can encode the groupstructure via a l₁/l₂ mixed norm regularization of A in Sparse LSA.Formally, we assume that the set of groups of input features G={g₁, . .. , g_(|G|)} is defined as a subset of the power set of {1, . . . , M},and is available as prior knowledge. For the purpose of simplicity,groups are non-overlapped. The group structured Sparse LSA can beformulated as:

$\begin{matrix}{{\min_{U,A}{\frac{1}{2}{{X - {UA}}}_{F}^{2}}} + {\lambda{\sum\limits_{d = 1}^{D}{\sum\limits_{g \in G}^{\;}{w_{g}{A_{dg}}_{2}}}}}} & (19)\end{matrix}$Subject to U ^(T) U=I,

where A_(dg)εR^(|g|) is the subvector of A for the latent dimension dand the input features in group g; w_(g) is the predefinedregularization weight each group g, λ is the global regularizationparameter; and P·P₂ is the vector l₂-norm which enforces all thefeatures in group g for the d-th latent topic, A_(dg), to achieve zerossimultaneously. A simple strategy for setting w_(g) is w_(g)=√{squareroot over (|g|)} as in [?] so that the amount of penalization isadjusted by the size of each group.

To solve Eq. (19), when U is fixed, the optimization problem becomes:

$\begin{matrix}{{\min_{A}{f(A)}} \equiv {{\frac{1}{2}{{X - {UA}}}_{F}^{2}} + {\lambda{\sum\limits_{d = 1}^{D}{\sum\limits_{g \in G}^{\;}{w_{g}{{A_{dg}}_{2}.}}}}}}} & (20)\end{matrix}$To solve Eq. (20), an efficient block coordinate descent technique isused: at each iteration, the objective function is minimized withrespect to A_(dg) while the other entries in A are held fixed.

More precisely, assume that now fix a particular latent dimension d anda group g; the process optimizes ƒ(A) with respect to A_(dg). Denote thei-th row of U as U^(i) and the first part of ƒ(A) as g(A)≡½∥X−UA∥_(F) ²,the gradient of g(A) over A_(dg) is a |g| dimensional vector where the jε g-th element takes the following form:

$\left( \frac{\partial{g(A)}}{\partial A_{dg}} \right)_{j \in g} = {\sum\limits_{i = 1}^{N}{{u_{id}\left( {{\left( U^{i} \right)^{T}A_{j}} - x_{ij}} \right)}.}}$To convert

$\frac{\partial{g(A)}}{\partial A_{dg}}$in the vector form,

$C_{d}{\sum\limits_{i = 1}^{N}u_{id}^{2}}$and B_(dg) be the vector of length |g| such that:

$\begin{matrix}{\left( B_{dg} \right)_{j \in g} = {\sum\limits_{i = 1}^{N}{{u_{id}\left( {x_{ij} - {\sum\limits_{k \neq d}{u_{ik}a_{kj}}}} \right)}.}}} & (11)\end{matrix}$The vector form of

$\frac{\partial{g(A)}}{\partial A_{dg}}$can be written as:

$\frac{\partial{g(A)}}{\partial A_{dg}} = {{C_{d}A_{dg}} - {B_{dg}.}}$

The minimization of ƒ(A) with respect to A_(dg) has a closed-formsolution. Take the subgradient of ƒ(A) over A_(dg):

$\begin{matrix}{{\frac{\partial{f(A)}}{\partial A_{dg}} = {{\frac{\partial{g(A)}}{\partial A_{dg}} + \frac{\partial{A_{dg}}_{2}}{\partial A_{dg}}} = {{C_{d}A_{dg}} - B_{dg} + {\lambda\; w_{g}\frac{\partial{A_{dg}}_{2}}{\partial A_{dg}}}}}},} & (12)\end{matrix}$where

$\begin{matrix}{\frac{{\partial{PA}_{dg}}P_{2}}{\partial A_{dg}} = \left\{ \begin{matrix}\frac{A_{dg}}{{A_{dg}}_{2}} & {A_{dg} \neq 0} \\\left\{ {\alpha \in R^{g}} \middle| {{\alpha }_{2} \leq 1} \right\} & {A_{dg} \neq 0}\end{matrix} \right.} & (13)\end{matrix}$

The closed-form solution of A_(dg)* can be given in the followingproposition.

The optimal A_(dg)* for the minimization of ƒ(A) with respect to A_(dg)takes the following form:

$\begin{matrix}{A_{dg}^{*} = \left\{ {\begin{matrix}\frac{B_{dg}\left( {{B_{dg}}_{2} - {\lambda\; w_{g}}} \right)}{C_{d}{B_{dg}}_{2}} & {{B_{dg}}_{2} > {\lambda\; w_{g}}} \\0 & {{B_{dg}}_{2} \leq {\lambda\; w_{g}}}\end{matrix}.} \right.} & (14)\end{matrix}$

The block coordinate descent for optimizing A is summarized in thefollowing pseudo-code

Algorithm 2 Optimization for A with group structure   Input: X, U, thedimensionality of the latent space D, the global regularizationparameter λ, group structure

, regularization weights of groups

.  while A has not converged do   for d = 1, 2, . . . , D do    ComputeC_(d) = Σ_(i=1) ^(N) u_(id) ²    for all g ∈

 do     Compute B_(dg) according to Eq. (3.11)     if ∥B_(dg2)∥ > λω_(g)then      $\left. A_{d\; g}\leftarrow\frac{B_{d\; g}\left( {{B_{d\; g}}_{2} - \lambda_{\omega_{g}}} \right)}{C_{d}{B_{d\; g}}_{2}} \right.$    else      A_(dg) ← 0     end if    end for   end for  end whileOutput: Sparse projection matrix A.

Non-Negative Sparse LSA

It is natural to assume that each word has a non-negative contributionto a specific topic, i.e. the projection matrix A should benon-negative. In such a case, the system can normalize each row of A to1:

${\hat{a}}_{dj} = {\frac{a_{dj}}{\sum\limits_{j = 1}^{M}a_{dj}}.}$Since a_(dj) measures the relevance of the j-th word, w_(j), to the d-thtopic t_(d), from the probability perspective, â_(dj), can be viewed asa pseudo probability of the word w_(i) given the topic t_(d),P(w_(j)|t_(d)). Similar to topic modeling in the Bayesian framework suchas LDA, the non-negative Sparse LSA can also provide the mostrelevant/likely words to a specific topic.

More formally, the non-negative Sparse LSA can be formulated as thefollowing optimization problem:min_(U,A)½∥X−UA∥ _(F) ² +λ∥A∥ ₁  (15)Subject to U ^(T) U=I,A≧0.

According to the non-negativity constraint of A, |a_(dj)|=a_(dj) and Eq.(15) is equivalent to:

$\begin{matrix}{{\min_{U,A}{\frac{1}{2}{{X - {UA}}}_{F}^{2}}} + {\lambda{\sum\limits_{d = 1}^{D}{\sum\limits_{j = 1}^{J}a_{dj}}}}} & (16)\end{matrix}$Subject to U ^(T) U=I,A≧0.

When A is fixed, the optimization with respect to U is the same as thatin Eq. (16). When U is fixed, the optimization over A can be decomposedin to M independent subproblems, each one corresponds to a column of A:

$\begin{matrix}{{\min\limits_{A_{j} \geq 0}{f\left( A_{j} \right)}} = {{\frac{1}{2}{{X_{j} - {UA}_{j}}}} + {\lambda{\sum\limits_{d = 1}^{D}{a_{dj}.}}}}} & (17)\end{matrix}$

The coordinate descent method is used to minimize ƒ(A_(j)): fix adimension of the latent space d; optimize ƒ(A_(j)) with respect toa_(dj) while keeping other entries in ƒ(A_(j)) fixed and iterate over d.More precisely, for a fixed d, the gradient of ƒ(A_(j)) with respect toa_(dj):

$\begin{matrix}{{\frac{\partial{f\left( A_{j} \right)}}{\partial a_{dj}} = {{c_{d}a_{dj}} - b_{d} + \lambda}},} & (18)\end{matrix}$where

${c_{d}{\sum\limits_{i = 1}^{N}u_{id}^{2}}},$

$b_{d} = {\sum\limits_{i = 1}^{N}{{u_{id}\left( {x_{ij} - {\sum\limits_{k \neq d}{u_{ik}a_{kj}}}} \right)}.}}$When b_(d)>λ, setting

$a_{dj} = \frac{b_{d} - \lambda}{c_{d}}$will make

$\frac{\partial{f\left( A_{j} \right)}}{\partial a_{dj}} = 0.$If b_(d)≦λ,

$\frac{\partial{f\left( A_{j} \right)}}{\partial a_{dj}} \geq 0$for all a_(dj)≦0, f(A_(j)) is a monotonic increasing function witha_(dj) and the minimal is achieved when a_(dj)=0.

In sum, the optimal a_(dj)* for the minimization ƒ(A_(j)) with respectto a_(dj) takes the following form:

$\begin{matrix}{a_{dj}^{*} = \left\{ {\begin{matrix}\frac{b_{d} - \lambda}{c_{d}} & {b_{d} > \lambda} \\0 & {b_{d} \leq \lambda}\end{matrix}.} \right.} & (19)\end{matrix}$

Compared to the traditional LSA, Sparse LSA selects only a small numberof relevant words for each topic and hence provides a compactrepresentation of topic-word relationships. Moreover, Sparse LSA iscomputationally very efficient with significantly reduced memory usagefor storing the projection matrix. Empirical results show that SparseLSA achieves similar performance gains to LSA, but is more efficient inprojection computation, storage, and also well explain the topic-wordrelationships. For example, the LSA is more effective in the followingways:

-   -   1. Provide a clear and compact representation of the topic-word        relationship, which makes the model easy to explain    -   2. Consumes less memory.    -   3. Able to handle more topics.

The Sparse LSA model is intuitive that only a part of the vocabulary canbe relevant to a certain topic. By enforcing sparsity of A such thateach row (representing a latent topic) only has a small number nonzeroentries (representing the most relevant words), Sparse LSA can provide acompact representation for topic-word relationship that is easier tointerpret. Further, with the adjustment of sparsity level in projectionmatrix, the system could control the granularity (“level-of-details”) ofthe topics it is trying to discover, e.g. more generic topics have morenonzero entries in rows of A than specific topics. Due to the sparsityof A, Sparse LSA provides an efficient strategy both in the time cost ofthe projection operation and in the storage cost of the projectionmatrix when the dimensionality of latent space D is large. The SparseLSA could project a document q into a sparse vector representation{circumflex over (q)} where each entry of {circumflex over (q)}corresponds to a latent topic. In other words, we could know the topicsthat q belongs to directly form the position of nonzero entries of{circumflex over (q)}. Moreover, sparse representation of projecteddocuments will save a lot of computational cost for the subsequentretrieval tasks, e.g. ranking (considering computing cosine similarity),text categorization, among others.

The Sparse LSA provides a more compact and precise projection byselecting only a small number of relevant words for each latent topic.Extensions of Sparse LSA, group structured LSA and non-negative SparseLSA, along with a simple yet efficient syste to learn the sparseprojection matrix for these models, are presented. The inventorscontemplate utilizing the online learning scheme to learn web-scaledatasets.

The invention may be implemented in hardware, firmware or software, or acombination of the three. Preferably the invention is implemented in acomputer program executed on a programmable computer having a processor,a data storage system, volatile and non-volatile memory and/or storageelements, at least one input device and at least one output device.

By way of example, a block diagram of a computer to support the systemis discussed next in FIG. 3. The computer preferably includes aprocessor, random access memory (RAM), a program memory (preferably awritable read-only memory (ROM) such as a flash ROM) and an input/output(I/O) controller coupled by a CPU bus. The computer may optionallyinclude a hard drive controller which is coupled to a hard disk and CPUbus. Hard disk may be used for storing application programs, such as thepresent invention, and data. Alternatively, application programs may bestored in RAM or ROM. I/O controller is coupled by means of an I/O busto an I/O interface. I/O interface receives and transmits data in analogor digital form over communication links such as a serial link, localarea network, wireless link, and parallel link. Optionally, a display, akeyboard and a pointing device (mouse) may also be connected to I/O bus.Alternatively, separate connections (separate buses) may be used for I/Ointerface, display, keyboard and pointing device. Programmableprocessing system may be preprogrammed or it may be programmed (andreprogrammed) by downloading a program from another source (e.g., afloppy disk, CD-ROM, or another computer).

Each computer program is tangibly stored in a machine-readable storagemedia or device (e.g., program memory or magnetic disk) readable by ageneral or special purpose programmable computer, for configuring andcontrolling operation of a computer when the storage media or device isread by the computer to perform the procedures described herein. Theinventive system may also be considered to be embodied in acomputer-readable storage medium, configured with a computer program,where the storage medium so configured causes a computer to operate in aspecific and predefined manner to perform the functions describedherein.

The invention has been described herein in considerable detail in orderto comply with the patent Statutes and to provide those skilled in theart with the information needed to apply the novel principles and toconstruct and use such specialized components as are required. However,it is to be understood that the invention can be carried out byspecifically different equipment and devices, and that variousmodifications, both as to the equipment details and operatingprocedures, can be accomplished without departing from the scope of theinvention itself.

What is claimed is:
 1. A method to perform preference learning on a setof documents, comprising: receiving raw input features from the set ofdocuments stored on a data storage device; generating polynomialcombinations from the raw input features; generating one or moreparameters W; applying W to one or more classifiers to generate outputs;determining a loss function and parameter gradients and updating W;determining one or more sparse regularizing terms and updating W; andexpressing that one document is preferred over another in a search queryand retrieving one or more documents responsive to the search query,wherein said regularizing terms are responsive to an imposing of anentry-wise l₁ regularization on W and a refitting without theregularization is used to improve preference prediction while keepingthe learned sparsity and said refitting reducing additional biasintroduced by the l₁ regularization, said refitting being affected by${P_{\Omega}(W)}_{ij} = \left\{ \begin{matrix}{{W_{ij}\mspace{14mu}{if}\mspace{14mu}\left( {i,j} \right)} \in \Omega} \\{{0\mspace{14mu}{if}\mspace{14mu}\left( {i,j} \right)\mspace{14mu}{not}} \in \Omega}\end{matrix} \right.$  where W represents a relationship between a pairof words, Ω represents indices of non-zero entries of sparse W, and ijare matrix iterations.
 2. The method of claim 1, comprising constrainingW to be a sparse matrix zero entry for pairs of words irrelevant to thepreference learning.
 3. The method of claim 1, comprising enforcing anl₁ regularization with mini-batch shrinking for every predeterminediterations in a stochastic gradient descent.
 4. The method of claim 1,comprising performing a stochastic (sub)gradient descent (SGD) with anonline learning framework.
 5. The method of claim 1, comprising applyinga learning rate η_(t), a decaying learning rate is used with η₁=C/√t,where C is a pre-defined constant as an initial learning rate and trepresents an iteration .
 6. The method of claim 1, comprising shrinkingW_(ij) ^(t) with an absolute value less than λη_(t) to zero andgenerating a sparse W matrix, λ is a regularization parameter whichcontrols a sparsity level (number of nonzero entries) of W and η_(t)denotes a learning rate.
 7. The method of claim 1, comprising performingshrinkage at every iteration to generate a sparse W .
 8. A computer toperform preference learning on a set of documents, comprising: means forreceiving raw input features from the set of documents stored on a datastorage device; means for generating polynomial combinations from theraw input features ; means for generating one or more parameters W;means for applying W to one or more classifiers to generate outputs;means for determining a loss function and parameter gradients andupdating W; means for determining one or more sparse regularizing termsand updating W; and means for expressing that one document is preferredover another in a search query and retrieving one or more documentsresponsive to the search query, wherein said regularizing terms areresponsive to an imposing of an entry-wise l₁ regularization on W and toreduce additional bias introduced by the l₁ regularization a refittingwithout the regularization is used to improve preference predictionwhile keeping the learned sparsity, said refitting being related to${P_{\Omega}(W)}_{ij} = \left\{ \begin{matrix}{{W_{ij}\mspace{14mu}{if}\mspace{14mu}\left( {i,j} \right)} \in \Omega} \\{{0\mspace{14mu}{if}\mspace{14mu}\left( {i,j} \right)\mspace{14mu}{not}} \in \Omega}\end{matrix} \right.$  where W represents a relationship between a pairof words, Ω represents indices of non-zero entries of sparse W, and ijare matrix iterations.
 9. The computer of claim 8, comprising means forenforcing an l₁ regularization with mini-batch shrinking for everypredetermined iterations in a stochastic gradient descent.